For this project, we will follow the DCOVAC process. The process is listed below:
DCOVAC – THE DATA MODELING FRAMEWORK
Unemployment is a constant issue in developed societies that only grows worse. Various organizations have created programs to assist unemployed people with resources to find employment. This analysis aims to evaluate the efficiency of techniques used to combat unemployment and analyze the demographic features of the unemployed population.
Jobs II data is a job search intervention study investigating the efficacy of a job training intervention on the unemployed. The program is designed to increase reemployment among the unemployed and enhance the mental health of job seekers. During the study, the subjects participated in job-sills workshops that taught skills for finding a new job and ways to handle setbacks pertaining to the employment process. The original data set contains 899 rows and 17 columns.
Vinokur, A. and Schul, Y. (1997). Mastery and inoculation against setbacks as active ingredients in the jobs intervention for the unemployed. Journal of Consulting and Clinical Psychology 65(5):867-77.
VARIABLES TO PREDICT WITH
VARIABLES WE WANT TO PREDICT
After organization, the clean data set contains 200 rows and 10 columns with no missing values Some variables and rows of data were removed from the original data to increase readability and improve the efficiency of predictive models.
Treatment Econ_Hardship Depression Age Job_Seek
0: 62 Min. :1.00 Min. :1.000 Min. :18.27 Min. :1.833
1:137 1st Qu.:2.33 1st Qu.:1.360 1st Qu.:29.78 1st Qu.:3.667
Median :3.00 Median :1.800 Median :37.28 Median :4.000
Mean :2.99 Mean :1.839 Mean :38.23 Mean :4.059
3rd Qu.:3.67 3rd Qu.:2.270 3rd Qu.:44.74 3rd Qu.:4.667
Max. :5.00 Max. :3.000 Max. :67.50 Max. :5.000
Marital_Status White_Nonwhite Education Sex
divorced :34 nonwhite: 42 bachelors_degree :33 female:106
married :86 white :157 graduate_work :21 male : 93
never_married:66 highschool_degree:64
separated :11 some_college :73
widowed : 2 some_highschool : 8
Employment
employed : 61
unemployed:138
From this data we can see that our variables have a variety of different values based on their types. The summary statistics of the organized data set give insight on the participants of the study and the general unemployed population.
Observations Include:
Over two-thirds of the participants in the data set were assigned to be in the study from an outside source and did not seek employment assistance on their own.
Most participants experienced levels of economic hardship above the average citizen.
Most participants experienced depressive symptoms prior to the study.
The average participant age was 38 years old.
Most participants showed high job-search self-efficacy.
The most common marital status of participants was married.
Over three-fourths of the participants were white.
The most common education level of participants was some college. Participants with a high school degree closely followed.
There was a similar amount of men participants in the study as there was women.
Less than one-third of participants gained employment after the study.
How does average level of depression vary by education levels?
In order to determine if education group could have a significant impact on level depression, a chart displaying the average depression level for each education group.
| We can see that the average depression level after the study is similar across all categories of education. The categories some_highschool and graduate_work had the highest average levels of depression reported after the study was complete. This bar chat tells us that a participants level of educated completed will not have a significant influence on the predicted depression level of a participant. However, participants that have completed some high school education or graduate work may show higher depression levels. |
We can see from the histograms that the distribution of age is fairly spread out, and concentrated from 25 years of age to 45. The distribution for economic hardship levels is concentrated in the middle. The distribution of Job-Search Self Efficacy is skewed right, telling us most participants showed strong efforts to find employment.
We can see from the counts that sex or education level will most likely not have a significant impact on depression level as they are fairly evenly distributed.
We can see that the distribution of the depression response variable is skewed left. The count of employment also shows us that the study was only one-third successful. This tells us that it is more likely for participants to not gain employment after the study was complete.
For this analysis we will use a Linear Regression Model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| Econ_Hardship | 0.230 | 0.038 | 6.116 | 0.000 |
| (Intercept) | 1.620 | 0.330 | 4.912 | 0.000 |
| Sexmale | -0.236 | 0.073 | -3.233 | 0.001 |
| Age | -0.010 | 0.004 | -2.463 | 0.015 |
| White_Nonwhitewhite | 0.213 | 0.092 | 2.320 | 0.021 |
| Educationsome_college | -0.146 | 0.107 | -1.368 | 0.173 |
| Employmentunemployed | 0.086 | 0.082 | 1.052 | 0.294 |
| Educationgraduate_work | 0.144 | 0.140 | 1.027 | 0.306 |
| Job_Seek | -0.050 | 0.055 | -0.902 | 0.368 |
| Marital_Statusnever_married | 0.063 | 0.116 | 0.545 | 0.587 |
| Marital_Statusseparated | 0.093 | 0.176 | 0.527 | 0.599 |
| Educationsome_highschool | 0.071 | 0.198 | 0.357 | 0.721 |
| Educationhighschool_degree | -0.036 | 0.108 | -0.332 | 0.741 |
| Marital_Statusmarried | 0.033 | 0.102 | 0.325 | 0.746 |
| Marital_Statuswidowed | 0.069 | 0.369 | 0.187 | 0.852 |
| Treatment1 | 0.008 | 0.077 | 0.101 | 0.919 |
After examining this model, we determine that there are some predictors that are not important in predicting the depression level, so a pruned version of the model is created by removing predictors that are not significant.
For this analysis we will use a pruned Linear Regression Model. We removed Treatment and Martial_Status
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| Econ_Hardship | 0.231 | 0.036 | 6.358 | 0.000 |
| (Intercept) | 1.701 | 0.286 | 5.943 | 0.000 |
| Sexmale | -0.235 | 0.071 | -3.296 | 0.001 |
| Age | -0.011 | 0.003 | -3.152 | 0.002 |
| White_Nonwhitewhite | 0.208 | 0.089 | 2.341 | 0.020 |
| Educationsome_college | -0.145 | 0.105 | -1.385 | 0.168 |
| Employmentunemployed | 0.089 | 0.080 | 1.103 | 0.272 |
| Educationgraduate_work | 0.150 | 0.136 | 1.098 | 0.274 |
| Job_Seek | -0.051 | 0.054 | -0.939 | 0.349 |
| Educationsome_highschool | 0.078 | 0.195 | 0.398 | 0.691 |
| Educationhighschool_degree | -0.035 | 0.106 | -0.326 | 0.745 |
Reducing the predictors that did help with the prediction of depression levels. The R-square increased, showing an improvement in the fit of the model.
The model shows us that economic hardship, age, and sex were all significant variables for prediction of depressionlevelsl.
From the following table, we can see the effect of the Depression level by the predictor variables.
| Variable | Direction |
|---|---|
| Econ_Hardship | Increase |
| Sex(male) | Decrease |
| age | Decrease |
| White_Nonwhite(white) | Increase |
| Education(some_college | Decrease |
| Employment(unemployed) | Increase |
| Education(graduate_work) | Increase |
| job_seek | Decrease |
| Education(some_highscool) | Increase |
| Education(highschool_degree) | Decrease |
The neural network model has a fairly high misclassification rate. With over one-fourth of the data misclassified for the validation and testing set, a neural network model is not the best fit for this set of data.
The boosted tree model has a similar misclassification rate but a higher r-squared value. Each variable is used in a significant amount of splits. Therefore, the variables will not be pruned for this model. The column contribution table shows us that sex, age, and job_seek were the most significant variables for prediction in this model. A boosted tree model is more effective for analysis than a neural network because the r-squared is higher. As well, highas er degree of explanation in a boosted tree model than in a neural network.
---
title: "Jobs II Data Analysis: Job Training Efficiency for the Unemployed"
output:
flexdashboard::flex_dashboard:
vertical_layout: scroll
source_code: embed
---
```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(ggplot2)
library(flexdashboard)
library(tidyverse)
library(GGally)
library(dplyr)
library(caret) #for logistic regression
library(broom)#for tidy() function
library(png)
library(imager)
library(graphics)
library(jpeg)
library(Hmisc)
```
```{r load_data}
#Read in Data
df <- read_csv("C:/Users/Tates/Desktop/3200Project/JobsIICleanData.csv")
#Change data types for summary stats
df$Marital_Status <- as.factor(df$Marital_Status)
df$Employment <- as.factor(df$Employment)
df$White_Nonwhite <- as.factor(df$White_Nonwhite)
df$Education <- as.factor(df$Education)
df$Treatment <- as.factor(df$Treatment)
df$Sex <- as.factor(df$Sex)
```
# Introduction {data-orientation="rows"}
## Row {data-height="250"}
### Overview
For this project, we will follow the DCOVAC process. The process is listed below:
DCOVAC -- THE DATA MODELING FRAMEWORK
- DEFINE the Problem
- COLLECT the Data from Appropriate Sources
- ORGANIZE the Data Collected
- VISUALIZE the Data by Developing Charts
- ANALYZE the data with Appropriate Statistical Methods
- COMMUNICATE your Results
## Row {data-height="650"}
### The Problem & Data Collection
#### The Problem
Unemployment is a constant issue in developed societies that only grows worse. Various organizations have created programs to assist unemployed people with resources to find employment. This analysis aims to evaluate the efficiency of techniques used to combat unemployment and analyze the demographic features of the unemployed population.
#### The Data
Jobs II data is a job search intervention study investigating the efficacy of a job training intervention on the unemployed. The program is designed to increase reemployment among the unemployed and enhance the mental health of job seekers. During the study, the subjects participated in job-sills workshops that taught skills for finding a new job and ways to handle setbacks pertaining to the employment process. The original data set contains 899 rows and 17 columns.
#### Data Sources
Vinokur, A. and Schul, Y. (1997). Mastery and inoculation against setbacks as active ingredients in the jobs intervention for the unemployed. Journal of Consulting and Clinical Psychology 65(5):867-77.
### The Data
VARIABLES TO PREDICT WITH
- *Treatment*: Indicator variable for whether participant was randomly selected for the JOBS II training program. 1 = assignment to participation.
- *Econ_Hardship*: Level of pre-treatment economic hardship determined with a pre-screening questionnaire. Continuous values from 1 to 5, 5 being the highest level of economic hardship.
- *Age*: Age of participant during pre-screening questionnaire. Continuous values based on the day, month, and year.
- *Job_Seek*: Measure of job-search self-efficacy shown by the participant during the study. Continuous values from 1 to 5, 5 being the highest level of self-efficacy.
- *Marital_Status*: Marital status of the participant during pre-screening questionnaire. 5 categories: : married, never_married, separated, divorced, widowed.
- *White_Nonwhite*: Whether the participant is white or of a different race. 2 categories: white, nonwhite.
- *Education*: Level of previous education completed during pre-screening questionnaire. 5 categories: graduate_work, some_college, bachelors_degree, highschool_degree, some_highschool.
- *Sex*: Sex of the participant. 2 Categories: female, male.
VARIABLES WE WANT TO PREDICT
- *Employment*: If the participant gained employment after the study. Assessed in a follow-up interview. 2 categories: employed, unemployed
- *Depression*: Measure of depressive symptoms pre-treatment determined with a pre-screening questionnaire. Continuous values from 1 to 3, 3 being the highest level of depressive symptoms.
# Data
## Column {data-width="650"}
### Organize the Data
After organization, the clean data set contains 200 rows and 10 columns with no missing values Some variables and rows of data were removed from the original data to increase readability and improve the efficiency of predictive models.
```{r, cache=TRUE}
#the cache=TRUE can be removed. This will allow you to rerun your code without it having to run EVERYTHING from scratch every time. If the output seems to not reflect new updates, you can choose Knit, Clear Knitr cache to fix.
#View data
print(summary(df))
```
From this data we can see that our variables have a variety of different values based on their types. The summary statistics of the organized data set give insight on the participants of the study and the general unemployed population.
Observations Include:
1. Over two-thirds of the participants in the data set were assigned to be in the study from an outside source and did not seek employment assistance on their own.
2. Most participants experienced levels of economic hardship above the average citizen.
3. Most participants experienced depressive symptoms prior to the study.
4. The average participant age was 38 years old.
5. Most participants showed high job-search self-efficacy.
6. The most common marital status of participants was married.
7. Over three-fourths of the participants were white.
8. The most common education level of participants was some college. Participants with a high school degree closely followed.
9. There was a similar amount of men participants in the study as there was women.
10. Less than one-third of participants gained employment after the study.
## Column {data-width="350"}
# Data Visualization: Average Depression by Education Level
## Column {data-width="650"}
How does average level of depression vary by education levels?
In order to determine if education group could have a significant impact on level depression, a chart displaying the average depression level for each education group.
```{r, cache=TRUE}
average_depression <- aggregate(Depression ~ Education, df, FUN = mean)
```
```{r, cache=TRUE}
barplot(average_depression$Depression, names.arg = average_depression$Education,
xlab = "Education", ylab = "Average Depression Level",
main = "Average Depression Level by Level of Education Completed",
col = "blue", ylim = c(0, max(average_depression$Depression) * 1.2))
```
| |
|------------------------------------------------------------------------|
| We can see that the average depression level after the study is similar across all categories of education. The categories some_highschool and graduate_work had the highest average levels of depression reported after the study was complete. This bar chat tells us that a participants level of educated completed will not have a significant influence on the predicted depression level of a participant. However, participants that have completed some high school education or graduate work may show higher depression levels. |
# Data Visualization: Distributions and Count
## Row {data-width="500"}
##### In order to understand what variables will be significant for predictive modeling, distributions of continuous variables and counts of categorical variables were analyzed.
## Row
## Continuous Predictor Variables

We can see from the histograms that the distribution of age is fairly spread out, and concentrated from 25 years of age to 45. The distribution for economic hardship levels is concentrated in the middle. The distribution of Job-Search Self Efficacy is skewed right, telling us most participants showed strong efforts to find employment.
## Row
### Categorical Predictor Variables

We can see from the counts that sex or education level will most likely not have a significant impact on depression level as they are fairly evenly distributed.
## Row
### Response Variables

We can see that the distribution of the depression response variable is skewed left. The count of employment also shows us that the study was only one-third successful. This tells us that it is more likely for participants to not gain employment after the study was complete.
```
```
# Depression Analysis {data-orientation="rows"}
## Row
### Predict Median Value
For this analysis we will use a Linear Regression Model.
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
Depressionlm <- lm(Depression ~ . ,data = df)
summary(Depressionlm)
```
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(Depressionlm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(Depressionlm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(Depressionlm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
## Row
### Regression Output
```{r,include=FALSE, cache=TRUE}
knitr::kable(summary(Depressionlm)$coef, digits = 3) #pretty table output
summary(Depressionlm)$coef
```
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(Depressionlm))[,4])
out <- coef(summary(Depressionlm))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(Depressionlm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```
## Row
### Analysis Summary
After examining this model, we determine that there are some predictors that are not important in predicting the depression level, so a pruned version of the model is created by removing predictors that are not significant.
## Row
### Predict Median Value Final Version
For this analysis we will use a pruned Linear Regression Model. We removed Treatment and Martial_Status
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
Depressionlm2 <- lm(Depression ~ . -Treatment -Marital_Status,data = df)
summary(Depressionlm2)
```
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(Depressionlm2)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(Depressionlm2)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(Depressionlm2)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
## Row
### Regression Output
```{r, include=FALSE, cache=TRUE}
knitr::kable(summary(Depressionlm2)$coef, digits = 3) #pretty table output
```
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(Depressionlm2))[,4])
out <- coef(summary(Depressionlm2))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(Depressionlm2, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```
## Row
### Analysis Summary
Reducing the predictors that did help with the prediction of depression levels. The R-square increased, showing an improvement in the fit of the model.
The model shows us that economic hardship, age, and sex were all significant variables for prediction of depressionlevelsl.
From the following table, we can see the effect of the Depression level by the predictor variables.
```{r, cache=TRUE}
#create table summary of predictor changes
predchang = data_frame(
Variable = c('Econ_Hardship', 'Sex(male)', 'age','White_Nonwhite(white)','Education(some_college','Employment(unemployed)','Education(graduate_work)','job_seek', 'Education(some_highscool)','Education(highschool_degree)'),
Direction = c('Increase','Decrease','Decrease','Increase', 'Decrease','Increase','Increase', 'Decrease','Increase', 'Decrease')
)
knitr::kable(predchang) #pretty table output
```
# Employment Analysis {data-width="500"}
## Row {data-width="500"}
### Predict Employment with a Neural Network Model
{width="304"} {width="300"}
## Row
#### Analysis
The neural network model has a fairly high misclassification rate. With over one-fourth of the data misclassified for the validation and testing set, a neural network model is not the best fit for this set of data.
## Row {data-width="500"}
### Predict Employment with a Boosted Tree Model
{width="335"}
{width="345"}
## Row
#### Analysis
The boosted tree model has a similar misclassification rate but a higher r-squared value. Each variable is used in a significant amount of splits. Therefore, the variables will not be pruned for this model. The column contribution table shows us that sex, age, and job_seek were the most significant variables for prediction in this model. A boosted tree model is more effective for analysis than a neural network because the r-squared is higher. As well, highas er degree of explanation in a boosted tree model than in a neural network.